System and method for conversational agent via adaptive caching of dialogue tree

ABSTRACT

The present teaching relates to method, system, medium, and implementations for managing a user machine dialogue. Sensor data is received at a device, including an utterance representing a speech of a user engaged in a dialogue with the device. The speech of the user is determined based on the utterance and a response to the user is searched by a local dialogue manager residing on the device against a sub-dialogue tree stored on the device. The response, if identified from the sub-dialogue tree, is rendered to the user in response to the speech. A request is sent to a server for the response, if the response is not available in the sub-dialogue tree.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/277,271, filed Feb. 15, 2019, which claims priority to U.S.Provisional Patent Application 62/630,983, filed Feb. 15, 2018, thecontents of which are hereby incorporated by reference in theirentireties.

The present application is related U.S. Patent Application InternationalApplication PCT/US2019/018217, filed Feb. 15, 2019, InternationalApplication PCT/US2019/018226, filed Feb. 15, 2019, U.S. patentapplication Ser. No. 16/277,301, filed Feb. 15, 2019, InternationalApplication PCT/US2019/018235, filed Feb. 15, 2019, U.S. patentapplication Ser. No. 16/277,337, filed Feb. 15, 2019, InternationalApplication PCT/US2019/018242, filed Feb. 15, 2019, U.S. patentapplication Ser. No. 16/277,381, filed Feb. 15, 2019, InternationalApplication PCT/US2019/018248, filed Feb. 15, 2019, and U.S. patentapplication Ser. No. 16/277,418, filed Feb. 15, 2019, the contents ofwhich are hereby incorporated by reference in their entireties.

BACKGROUND 1. Technical Field

The present teaching generally relates to computer. More specifically,the present teaching relates to computerized dialogue agent.

2. Technical Background

With advancement of artificial intelligence technologies and theexplosion Internet based communications because of the ubiquitousInternet's connectivity, computer aided dialogue systems have becomeincreasingly popular. For example, more and more call centers deployautomated dialogue robot to handle customer calls. Hotels started toinstall various kiosks that can answer questions from tourists orguests. Online bookings (whether travel accommodations or theatertickets, etc.) are also more frequently done by chatbots. In recentyears, automated human machine communications in other areas are alsobecoming more and more popular.

Such traditional computer aided dialogue systems are usuallypre-programed with certain questions and answers based on commonly knownpatterns of conversations in different domains. Unfortunately, humanconversant can be unpredictable and sometimes does not follow apre-planned dialogue pattern. In addition, in certain situations, ahuman conversant may digress during the process and continuing the fixedconversation patterns likely will cause irritation or loss of interests.When this happens, such machine traditional dialogue systems often willnot be able to continue to engage a human conversant so that the humanmachine dialogue either has to be aborted to hand the tasks to a humanoperator or the human conversant simply leaves the dialogue, which isundesirable.

In addition, traditional machine based dialogue systems are often notdesigned to address the emotional factor of a human, let alone takinginto consideration as to how to address such emotional factor whenconversing with a human. For example, a traditional machine dialoguesystem usually does not initiate the conversation unless a humanactivates the system or asks some questions. Even if a traditionaldialogue system does initiate a conversation, it has a fixed way tostart a conversation and does not change from human to human or adjustedbased on observations. As such, although they are programmed tofaithfully follow the pre-designed dialogue pattern, they are usuallynot able to act on the dynamics of the conversation and adapt in orderto keep the conversation going in a way that can engage the human. Inmany situations, when a human involved in a dialogue is clearly annoyedor frustrated, a traditional machine dialogue system is completelyunaware and continue the conversation in the same manner that hasannoyed the human. This not only makes the conversation end unpleasantly(the machine is still unaware of that) but also turns the person awayfrom conversing with any machine based dialogue system in the future.

In some application, conducting a human machine dialogue session basedon what is observed from the human is crucially important in order todetermine how to proceed effectively. One example is an educationrelated dialogue. When a chatbot is used for teaching a child to read,whether the child is perceptive to the way he/she is being taught has tobe monitored and addressed continuously in order to be effective.Another limitation of the traditional dialogue systems is their contextunawareness. For example, a traditional dialogue system is not equippedwith the ability to observe the context of a conversation and improviseas to dialogue strategy in order to engage a user and improve the userexperience.

Thus, there is a need for methods and systems that address suchlimitations.

SUMMARY

The teachings disclosed herein relate to methods, systems, andprogramming for a computerized dialogue agent.

In one example, a method, implemented on a machine having at least oneprocessor, storage, and a communication platform capable of connectingto a network for managing a user machine dialogue. Sensor data isreceived at a device, including an utterance representing a speech of auser engaged in a dialogue with the device. The speech of the user isdetermined based on the utterance and a response to the user is searchedby a local dialogue manager residing on the device against asub-dialogue tree stored on the device. The response, if identified fromthe sub-dialogue tree, is rendered to the user in response to thespeech. A request is sent to a server for the response, if the responseis not available in the sub-dialogue tree.

In a different example, a system for managing a user machine dialogue.The system includes a device comprising a sensor analyzer, a surroundinginformation understanding unit, a local dialogue manager, a responserendering unit, and a device/server coordinator. The sensor dataanalyzer is configured for receiving sensor data including an utterancerepresenting a speech of a user engaged in a dialogue with the device.The surrounding information understanding unit configured fordetermining the speech of the user based on the utterance. The localdialogue manager residing on the device and configured for searching asub-dialogue tree stored on the device for a response to the user basedon the speech. The response rendering unit configured for rendering theresponse to the user in response to the speech, if the response isidentified from the sub-dialogue tree. The device/server coordinatorconfigured for sending, if the response is not available in thesub-dialogue tree, a request to a server for the response.

Other concepts relate to software for implementing the present teaching.A software product, in accord with this concept, includes at least onemachine-readable non-transitory medium and information carried by themedium. The information carried by the medium may be executable programcode data, parameters in association with the executable program code,and/or information related to a user, a request, content, or otheradditional information.

In one example, a machine-readable, non-transitory and tangible mediumhaving data recorded thereon for managing a user machine dialogue,wherein the medium, when read by the machine, causes the machine toperform a series of steps. Sensor data is received at a device,including an utterance representing a speech of a user engaged in adialogue with the device. The speech of the user is determined based onthe utterance and a response to the user is searched by a local dialoguemanager residing on the device against a sub-dialogue tree stored on thedevice. The response, if identified from the sub-dialogue tree, isrendered to the user in response to the speech. A request is sent to aserver for the response, if the response is not available in thesub-dialogue tree.

Additional advantages and novel features will be set forth in part inthe description which follows, and in part will become apparent to thoseskilled in the art upon examination of the following and theaccompanying drawings or may be learned by production or operation ofthe examples. The advantages of the present teachings may be realizedand attained by practice or use of various aspects of the methodologies,instrumentalities and combinations set forth in the detailed examplesdiscussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

The methods, systems and/or programming described herein are furtherdescribed in terms of exemplary embodiments. These exemplary embodimentsare described in detail with reference to the drawings. Theseembodiments are non-limiting exemplary embodiments, in which likereference numerals represent similar structures throughout the severalviews of the drawings, and wherein:

FIG. 1 depicts a networked environment for facilitating a dialoguebetween a user operating a user device and an agent device inconjunction with a user interaction engine, in accordance with anembodiment of the present teaching;

FIGS. 2A-2B depict connections among a user device, an agent device, anda user interaction engine during a dialogue, in accordance with anembodiment of the present teaching;

FIG. 3A illustrates an exemplary structure of an agent device withexemplary types of agent body, in accordance with an embodiment of thepresent teaching;

FIG. 3B illustrates an exemplary agent device, in accordance with anembodiment of the present teaching;

FIG. 4A depicts an exemplary high level system diagram for an overallsystem for the automated companion, according to various embodiments ofthe present teaching;

FIG. 4B illustrates a part of a dialogue tree of an on-going dialoguewith paths taken based on interactions between the automated companionand a user, according to an embodiment of the present teaching;

FIG. 4C illustrates exemplary a human-agent device interaction andexemplary processing performed by the automated companion, according toan embodiment of the present teaching;

FIG. 5 illustrates exemplary multiple layer processing andcommunications among different processing layers of an automateddialogue companion, according to an embodiment of the present teaching;

FIG. 6 depicts an exemplary high level system framework for anartificial intelligence based educational companion, according to anembodiment of the present teaching;

FIG. 7 illustrates a device-server configuration of a human machinedialogue system;

FIG. 8 depicts an exemplary framework directed to human machine dialoguemanagement, according to an embodiment of the present teaching;

FIG. 9 depicts an exemplary high level system diagram of a device forhuman machine dialogue management, according to an embodiment of thepresent teaching;

FIG. 10 is a flowchart of an exemplary process of a device for humanmachine dialogue management, according to an embodiment of the presentteaching;

FIG. 11 depicts an exemplary system diagram of a server for humanmachine dialogue management, according to an embodiment of the presentteaching;

FIG. 12 is a flowchart of an exemplary process of a server for humanmachine dialogue management, according to an embodiment of the presentteaching;

FIG. 13 depicts an exemplary system diagram of a server deviceconfiguration for human machine dialogue management via preemptivelygenerated dialogue content, according to an embodiment of the presentteaching;

FIG. 14 depicts an exemplary system diagram of a server for humanmachine dialogue management via preemptively generated dialogue content,according to an embodiment of the present teaching;

FIG. 15 is a flowchart of an exemplary process of a server for humanmachine dialogue management via preemptively generated dialogue content,according to an embodiment of the present teaching;

FIG. 16 depicts a different exemplary system diagram of a server deviceconfiguration for human machine dialogue management via preemptivelygenerated dialogue content, according to an embodiment of the presentteaching;

FIG. 17 depicts an exemplary system diagram of a device for humanmachine dialogue management via preemptively generated dialogue content,according to an embodiment of the present teaching;

FIG. 18 is a flowchart of an exemplary process of a device for humanmachine dialogue management via preemptively generated dialogue content,according to an embodiment of the present teaching;

FIG. 19 depicts an exemplary system diagram of a server for humanmachine dialogue management via preemptively generated dialogue content,according to an embodiment of the present teaching;

FIG. 20 is a flowchart of an exemplary process of a server for humanmachine dialogue management via preemptively generated dialogue content,according to an embodiment of the present teaching;

FIG. 21 depicts yet another different exemplary system diagram of aserver device configuration for human machine dialogue management viapreemptively generated dialogue content, according to an embodiment ofthe present teaching;

FIG. 22 depicts an exemplary system diagram of a server for humanmachine dialogue management via preemptively generated dialogue content,according to a different embodiment of the present teaching;

FIG. 23 is a flowchart of an exemplary process of a server for humanmachine dialogue management via preemptively generated dialogue content,according to a different embodiment of the present teaching;

FIG. 24 is an illustrative diagram of an exemplary mobile devicearchitecture that may be used to realize a specialized systemimplementing the present teaching in accordance with variousembodiments; and

FIG. 25 is an illustrative diagram of an exemplary computing devicearchitecture that may be used to realize a specialized systemimplementing the present teaching in accordance with variousembodiments.

DETAILED DESCRIPTION

In the following detailed description, numerous specific details are setforth by way of examples in order to facilitate a thorough understandingof the relevant teachings. However, it should be apparent to thoseskilled in the art that the present teachings may be practiced withoutsuch details. In other instances, well known methods, procedures,components, and/or circuitry have been described at a relativelyhigh-level, without detail, in order to avoid unnecessarily obscuringaspects of the present teachings.

The present teaching aims to address the deficiencies of the traditionalhuman machine dialogue systems and to provide methods and systems thatenables a more effective and realistic human to machine dialogue. Thepresent teaching incorporates artificial intelligence in an automatedcompanion with an agent device in conjunction with the backbone supportfrom a user interaction engine so that the automated companion canconduct a dialogue based on continuously monitored multimodal dataindicative of the surrounding of the dialogue, adaptively estimating themindset/emotion/intent of the participants of the dialogue, andadaptively adjust the conversation strategy based on the dynamicallychanging information/estimates/contextual information.

The automated companion according to the present teaching is capable ofpersonalizing a dialogue by adapting in multiple fronts, including, butis not limited to, the subject matter of the conversation, thehardware/components used to carry out the conversation, and theexpression/behavior/gesture used to deliver responses to a humanconversant. The adaptive control strategy is to make the conversationmore realistic and productive by flexibly changing the conversationstrategy based on observations on how receptive the human conversant isto the dialogue. The dialogue system according to the present teachingcan be configured to achieve a goal driven strategy, includingdynamically configuring hardware/software components that are consideredmost appropriate to achieve an intended goal. Such optimizations arecarried out based on learning, including learning from priorconversations as well as from an on-going conversation by continuouslyassessing a human conversant's behavior/reactions during theconversation with respect to some intended goals. Paths exploited toachieve a goal driven strategy may be determined to remain the humanconversant engaged in the conversation even though in some instances,paths at some moments of time may appear to be deviating from theintended goal.

More specifically, the present teaching discloses a user interactionengine providing backbone support to an agent device to facilitate morerealistic and more engaging dialogues with a human conversant. FIG. 1depicts a networked environment 100 for facilitating a dialogue betweena user operating a user device and an agent device in conjunction with auser interaction engine, in accordance with an embodiment of the presentteaching. In FIG. 1 , the exemplary networked environment 100 includesone or more user devices 110, such as user devices 110-a, 110-b, 110-c,and 110-d, one or more agent devices 160, such as agent devices 160-a, .. . 160-b, a user interaction engine 140, and a user informationdatabase 130, each of which may communicate with one another via network120. In some embodiments, network 120 may correspond to a single networkor a combination of different networks. For example, network 120 may bea local area network (“LAN”), a wide area network (“WAN”), a publicnetwork, a proprietary network, a proprietary network, a PublicTelephone Switched Network (“PSTN”), the Internet, an intranet, aBluetooth network, a wireless network, a virtual network, and/or anycombination thereof. In one embodiment, network 120 may also includevarious network access points. For example, environment 100 may includewired or wireless access points such as, without limitation, basestations or Internet exchange points 120-a, . . . , 120-b. Base stations120-a and 120-b may facilitate, for example, communications to/from userdevices 110 and/or agent devices 160 with one or more other componentsin the networked framework 100 across different types of network.

A user device, e.g., 110-a, may be of different types to facilitate auser operating the user device to connect to network 120 andtransmit/receive signals. Such a user device 110 may correspond to anysuitable type of electronic/computing device including, but not limitedto, a desktop computer (110-d), a mobile device (110-a), a deviceincorporated in a transportation vehicle (110-b), . . . , a mobilecomputer (110-c), or a stationary device/computer (110-d). A mobiledevice may include, but is not limited to, a mobile phone, a smartphone, a personal display device, a personal digital assistant (“PDAs”),a gaming console/device, a wearable device such as a watch, a Fitbit, apin/broach, a headphone, etc. A transportation vehicle embedded with adevice may include a car, a truck, a motorcycle, a boat, a ship, atrain, or an airplane. A mobile computer may include a laptop, anUltrabook device, a handheld device, etc. A stationary device/computermay include a television, a set top box, a smart household device (e.g.,a refrigerator, a microwave, a washer or a dryer, an electronicassistant, etc.), and/or a smart accessory (e.g., a light bulb, a lightswitch, an electrical picture frame, etc.).

An agent device, e.g., any of 160-a, . . . , 160-b, may correspond oneof different types of devices that may communicate with a user deviceand/or the user interaction engine 140. Each agent device, as describedin greater detail below, may be viewed as an automated companion devicethat interfaces with a user with, e.g., the backbone support from theuser interaction engine 140. An agent device as described herein maycorrespond to a robot which can be a game device, a toy device, adesignated agent device such as a traveling agent or weather agent, etc.The agent device as disclosed herein is capable of facilitating and/orassisting in interactions with a user operating user device. In doingso, an agent device may be configured as a robot capable of controllingsome of its parts, via the backend support from the application server130, for, e.g., making certain physical movement (such as head),exhibiting certain facial expression (such as curved eyes for a smile),or saying things in a certain voice or tone (such as exciting tones) todisplay certain emotions.

When a user device (e.g., user device 110-a) is connected to an agentdevice, e.g., 160-a (e.g., via either a contact or contactlessconnection), a client running on a user device, e.g., 110-a, maycommunicate with the automated companion (either the agent device or theuser interaction engine or both) to enable an interactive dialoguebetween the user operating the user device and the agent device. Theclient may act independently in some tasks or may be controlled remotelyby the agent device or the user interaction engine 140. For example, torespond to a questions from a user, the agent device or the userinteraction engine 140 may control the client running on the user deviceto render the speech of the response to the user. During a conversation,an agent device may include one or more input mechanisms (e.g., cameras,microphones, touch screens, buttons, etc.) that allow the agent deviceto capture inputs related to the user or the local environmentassociated with the conversation. Such inputs may assist the automatedcompanion to develop an understanding of the atmosphere surrounding theconversation (e.g., movements of the user, sound of the environment) andthe mindset of the human conversant (e.g., user picks up a ball whichmay indicates that the user is bored) in order to enable the automatedcompanion to react accordingly and conduct the conversation in a mannerthat will keep the user interested and engaging.

In the illustrated embodiments, the user interaction engine 140 may be abackend server, which may be centralized or distributed. It is connectedto the agent devices and/or user devices. It may be configured toprovide backbone support to agent devices 160 and guide the agentdevices to conduct conversations in a personalized and customizedmanner. In some embodiments, the user interaction engine 140 may receiveinformation from connected devices (either agent devices or userdevices), analyze such information, and control the flow of theconversations by sending instructions to agent devices and/or userdevices. In some embodiments, the user interaction engine 140 may alsocommunicate directly with user devices, e.g., providing dynamic data,e.g., control signals for a client running on a user device to rendercertain responses.

Generally speaking, the user interaction engine 140 may control thestate and the flow of conversations between users and agent devices. Theflow of each of the conversations may be controlled based on differenttypes of information associated with the conversation, e.g., informationabout the user engaged in the conversation (e.g., from the userinformation database 130), the conversation history, surroundinformation of the conversations, and/or the real time user feedbacks.In some embodiments, the user interaction engine 140 may be configuredto obtain various sensory inputs such as, and without limitation, audioinputs, image inputs, haptic inputs, and/or contextual inputs, processthese inputs, formulate an understanding of the human conversant,accordingly generate a response based on such understanding, and controlthe agent device and/or the user device to carry out the conversationbased on the response. As an illustrative example, the user interactionengine 140 may receive audio data representing an utterance from a useroperating user device, and generate a response (e.g., text) which maythen be delivered to the user in the form of a computer generatedutterance as a response to the user. As yet another example, the userinteraction engine 140 may also, in response to the utterance, generateone or more instructions that control an agent device to perform aparticular action or set of actions.

As illustrated, during a human machine dialogue, a user, as the humanconversant in the dialogue, may communicate across the network 120 withan agent device or the user interaction engine 140. Such communicationmay involve data in multiple modalities such as audio, video, text, etc.Via a user device, a user can send data (e.g., a request, audio signalrepresenting an utterance of the user, or a video of the scenesurrounding the user) and/or receive data (e.g., text or audio responsefrom an agent device). In some embodiments, user data in multiplemodalities, upon being received by an agent device or the userinteraction engine 140, may be analyzed to understand the human user'sspeech or gesture so that the user's emotion or intent may be estimatedand used to determine a response to the user.

FIG. 2A depicts specific connections among a user device 110-a, an agentdevice 160-a, and the user interaction engine 140 during a dialogue, inaccordance with an embodiment of the present teaching. As seen,connections between any two of the parties may all be bi-directional, asdiscussed herein. The agent device 160-a may interface with the user viathe user device 110-a to conduct a dialogue in a bi-directionalcommunications. On one hand, the agent device 160-a may be controlled bythe user interaction engine 140 to utter a response to the useroperating the user device 110-a. On the other hand, inputs from the usersite, including, e.g., both the user's utterance or action as well asinformation about the surrounding of the user, are provided to the agentdevice via the connections. The agent device 160-a may be configured toprocess such input and dynamically adjust its response to the user. Forexample, the agent device may be instructed by the user interactionengine 140 to render a tree on the user device. Knowing that thesurrounding environment of the user (based on visual information fromthe user device) shows green trees and lawns, the agent device maycustomize the tree to be rendered as a lush green tree. If the scenefrom the user site shows that it is a winter weather, the agent devicemay control to render the tree on the user device with parameters for atree that has no leaves. As another example, if the agent device isinstructed to render a duck on the user device, the agent device mayretrieve information from the user information database 130 on colorpreference and generate parameters for customizing the duck in a user'spreferred color before sending the instruction for the rendering to theuser device.

In some embodiments, such inputs from the user's site and processingresults thereof may also be transmitted to the user interaction engine140 for facilitating the user interaction engine 140 to betterunderstand the specific situation associated with the dialogue so thatthe user interaction engine 140 may determine the state of the dialogue,emotion/mindset of the user, and to generate a response that is based onthe specific situation of the dialogue and the intended purpose of thedialogue (e.g., for teaching a child the English vocabulary). Forexample, if information received from the user device indicates that theuser appears to be bored and become impatient, the user interactionengine 140 may determine to change the state of dialogue to a topic thatis of interest of the user (e.g., based on the information from the userinformation database 130) in order to continue to engage the user in theconversation.

In some embodiments, a client running on the user device may beconfigured to be able to process raw inputs of different modalitiesacquired from the user site and send the processed information (e.g.,relevant features of the raw inputs) to the agent device or the userinteraction engine for further processing. This will reduce the amountof data transmitted over the network and enhance the communicationefficiency. Similarly, in some embodiments, the agent device may also beconfigured to be able to process information from the user device andextract useful information for, e.g., customization purposes. Althoughthe user interaction engine 140 may control the state and flow controlof the dialogue, making the user interaction engine 140 light weightimproves the user interaction engine 140 scale better.

FIG. 2B depicts the same setting as what is presented in FIG. 2A withadditional details on the user device 110-a. As shown, during a dialoguebetween the user and the agent 210, the user device 110-a maycontinually collect multi-modal sensor data related to the user andhis/her surroundings, which may be analyzed to detect any informationrelated to the dialogue and used to intelligently control the dialoguein an adaptive manner. This may further enhance the user experience orengagement. FIG. 2B illustrates exemplary sensors such as video sensor230, audio sensor 240, . . . , or haptic sensor 250. The user device mayalso send textual data as part of the multi-model sensor data. Together,these sensors provide contextual information surrounding the dialogueand can be used for the user interaction engine 140 to understand thesituation in order to manage the dialogue. In some embodiment, themulti-modal sensor data may first be processed on the user device andimportant features in different modalities may be extracted and sent tothe user interaction engine 140 so that dialogue may be controlled withan understanding of the context. In some embodiments, the rawmulti-modal sensor data may be sent directly to the user interactionengine 140 for processing.

As seen in FIGS. 2A-2B, the agent device may correspond to a robot thathas different parts, including its head 210 and its body 220. Althoughthe agent device as illustrated in FIGS. 2A-2B appears to be a personrobot, it may also be constructed in other forms as well, such as aduck, a bear, a rabbit, etc. FIG. 3A illustrates an exemplary structureof an agent device with exemplary types of agent body, in accordancewith an embodiment of the present teaching. As presented, an agentdevice may include a head and a body with the head attached to the body.In some embodiments, the head of an agent device may have additionalparts such as face, nose and mouth, some of which may be controlled to,e.g., make movement or expression. In some embodiments, the face on anagent device may correspond to a display screen on which a face can berendered and the face may be of a person or of an animal. Such displayedface may also be controlled to express emotion.

The body part of an agent device may also correspond to different formssuch as a duck, a bear, a rabbit, etc. The body of the agent device maybe stationary, movable, or semi-movable. An agent device with stationarybody may correspond to a device that can sit on a surface such as atable to conduct face to face conversation with a human user sittingnext to the table. An agent device with movable body may correspond to adevice that can move around on a surface such as table surface or floor.Such a movable body may include parts that can be kinematicallycontrolled to make physical moves. For example, an agent body mayinclude feet which can be controlled to move in space when needed. Insome embodiments, the body of an agent device may be semi-movable, i.e.,some parts are movable and some are not. For example, a tail on the bodyof an agent device with a duck appearance may be movable but the duckcannot move in space. A bear body agent device may also have arms thatmay be movable but the bear can only sit on a surface.

FIG. 3B illustrates an exemplary agent device or automated companion160-a, in accordance with an embodiment of the present teaching. Theautomated companion 160-a is a device that interacts with people usingspeech and/or facial expression or physical gestures. For example, theautomated companion 160-a corresponds to an animatronic peripheraldevice with different parts, including head portion 310, eye portion(cameras) 320, a mouth portion with laser 325 and a microphone 330, aspeaker 340, neck portion with servos 350, one or more magnet or othercomponents that can be used for contactless detection of presence 360,and a body portion corresponding to, e.g., a charge base 370. Inoperation, the automated companion 160-a may be connected to a userdevice which may include a mobile multi-function device (110-a) vianetwork connections. Once connected, the automated companion 160-a andthe user device interact with each other via, e.g., speech, motion,gestures, and/or via pointing with a laser pointer.

Other exemplary functionalities of the automated companion 160-a mayinclude reactive expressions in response to a user's response via, e.g.,an interactive video cartoon character (e.g., avatar) displayed on,e.g., a screen as part of a face on the automated companion. Theautomated companion may use a camera (320) to observe the user'spresence, facial expressions, direction of gaze, surroundings, etc. Ananimatronic embodiment may “look” by pointing its head (310) containinga camera (320), “listen” using its microphone (340), “point” bydirecting its head (310) that can move via servos (350). In someembodiments, the head of the agent device may also be controlledremotely by a, e.g., the user interaction engine 140 or by a client in auser device (110-a), via a laser (325). The exemplary automatedcompanion 160-a as shown in FIG. 3B may also be controlled to “speak”via a speaker (330).

FIG. 4A depicts an exemplary high level system diagram for an overallsystem for the automated companion, according to various embodiments ofthe present teaching. In this illustrated embodiment, the overall systemmay encompass components/function modules residing in a user device, anagent device, and the user interaction engine 140. The overall system asdepicted herein comprises a plurality of layers of processing andhierarchies that together carries out human-machine interactions in anintelligent manner. In the illustrated embodiment, there are 5 layers,including layer 1 for front end application as well as front endmulti-modal data processing, layer 2 for characterizations of the dialogsetting, layer 3 is where the dialog management module resides, layer 4for estimated mindset of different parties (human, agent, device, etc.),layer 5 for so called utility. Different layers may correspond differentlevels of processing, ranging from raw data acquisition and processingat layer 1 to layer 5 on processing changing utilities of participantsof dialogues.

The term “utility” is hereby defined as preferences of a partyidentified based on states detected associated with dialogue histories.Utility may be associated with a party in a dialogue, whether the partyis a human, the automated companion, or other intelligent devices. Autility for a particular party may represent different states of aworld, whether physical, virtual, or even mental. For example, a statemay be represented as a particular path along which a dialog walksthrough in a complex map of the world. At different instances, a currentstate evolves into a next state based on the interaction betweenmultiple parties. States may also be party dependent, i.e., whendifferent parties participate in an interaction, the states arising fromsuch interaction may vary. A utility associated with a party may beorganized as a hierarchy of preferences and such a hierarchy ofpreferences may evolve over time based on the party's choices made andlikings exhibited during conversations. Such preferences, which may berepresented as an ordered sequence of choices made out of differentoptions, is what is referred to as utility. The present teachingdiscloses method and system by which an intelligent automated companionis capable of learning, through a dialogue with a human conversant, theuser's utility.

Within the overall system for supporting the automated companion, frontend applications as well as front end multi-modal data processing inlayer 1 may reside in a user device and/or an agent device. For example,the camera, microphone, keyboard, display, renderer, speakers,chat-bubble, and user interface elements may be components or functionalmodules of the user device. For instance, there may be an application orclient running on the user device which may include the functionalitiesbefore an external application interface (API) as shown in FIG. 4A. Insome embodiments, the functionalities beyond the external API may beconsidered as the backend system or reside in the user interactionengine 140. The application running on the user device may takemulti-model data (audio, images, video, text) from the sensors orcircuitry of the user device, process the multi-modal data to generatetext or other types of signals (object such as detected user face,speech understanding result) representing features of the rawmulti-modal data, and send to layer 2 of the system.

In layer 1, multi-modal data may be acquired via sensors such as camera,microphone, keyboard, display, speakers, chat bubble, renderer, or otheruser interface elements. Such multi-modal data may be analyzed toestimated or infer various features that may be used to infer higherlevel characteristics such as expression, characters, gesture, emotion,action, attention, intent, etc. Such higher level characteristics may beobtained by processing units at layer 2 and the used by components ofhigher layers, via the internal API as shown in FIG. 4A, to e.g.,intelligently infer or estimate additional information related to thedialogue at higher conceptual levels. For example, the estimatedemotion, attention, or other characteristics of a participant of adialogue obtained at layer 2 may be used to estimate the mindset of theparticipant. In some embodiments, such mindset may also be estimated atlayer 4 based on additional information, e.g., recorded surroundingenvironment or other auxiliary information in such surroundingenvironment such as sound.

The estimated mindsets of parties, whether related to humans or theautomated companion (machine), may be relied on by the dialoguemanagement at layer 3, to determine, e.g., how to carry on aconversation with a human conversant. How each dialogue progresses oftenrepresent a human user's preferences. Such preferences may be captureddynamically during the dialogue at utilities (layer 5). As shown in FIG.4A, utilities at layer 5 represent evolving states that are indicativeof parties' evolving preferences, which can also be used by the dialoguemanagement at layer 3 to decide the appropriate or intelligent way tocarry on the interaction.

Sharing of information among different layers may be accomplished viaAPIs. In some embodiments as illustrated in FIG. 4A, information sharingbetween layer 1 and rest of the layers is via an external API whilesharing information among layers 2-5 is via an internal API. It isunderstood that this merely a design choice and other implementationsare also possible to realize the present teaching presented herein. Insome embodiments, through the internal API, various layers (2-5) mayaccess information created by or stored at other layers to support theprocessing. Such information may include common configuration to beapplied to a dialogue (e.g., character of the agent device is an avatar,voice preferred, or a virtual environment to be created for thedialogue, etc.), a current state of the dialogue, a current dialoguehistory, known user preferences, estimated user intent/emotion/mindset,etc. In some embodiments, some information that may be shared via theinternal API may be accessed from an external database. For example,certain configurations related to a desired character for the agentdevice (a duck) may be accessed from, e.g., an open source database,that provide parameters (e.g., parameters to visually render the duckand/or parameters needed to render the speech from the duck).

FIG. 4B illustrates a part of a dialogue tree of an on-going dialoguewith paths taken based on interactions between the automated companionand a user, according to an embodiment of the present teaching. In thisillustrated example, the dialogue management at layer 3 (of theautomated companion) may predict multiple paths with which a dialogue,or more generally an interaction, with a user may proceed. In thisexample, each node may represent a point of the current state of thedialogue and each branch from a node may represent possible responsesfrom a user. As shown in this example, at node 1, the automatedcompanion may face with three separate paths which may be takendepending on a response detected from a user. If the user responds withan affirmative response, dialogue tree 400 may proceed from node 1 tonode 2. At node 2, a response may be generated for the automatedcompanion in response to the affirmative response from the user and maythen be rendered to the user, which may include audio, visual, textual,haptic, or any combination thereof.

If, at node 1, the user responses negatively, the path is for this stageis from node 1 to node 10. If the user responds, at node 1, with a“so-so” response (e.g., not negative but also not positive), dialoguetree 400 may proceed to node 3, at which a response from the automatedcompanion may be rendered and there may be three separate possibleresponses from the user, “No response,” “Positive Response,” and“Negative response,” corresponding to nodes 5, 6, and 7, respectively.Depending on the user's actual response with respect to the automatedcompanion's response rendered at node 3, the dialogue management atlayer 3 may then follow the dialogue accordingly. For instance, if theuser responds at node 3 with a positive response, the automatedcompanion moves to respond to the user at node 6. Similarly, dependingon the user's reaction to the automated companion's response at node 6,the user may further respond with an answer that is correct. In thiscase, the dialogue state moves from node 6 to node 8, etc. In thisillustrated example, the dialogue state during this period moved fromnode 1, to node 3, to node 6, and to node 8. The traverse through nodes1, 3, 6, and 8 forms a path consistent with the underlying conversationbetween the automated companion and a user. As seen in FIG. 4B, the pathrepresenting the dialogue is represented by the solid lines connectingnodes 1, 3, 6, and 8, whereas the paths skipped during a dialogue isrepresented by the dashed lines.

FIG. 4C illustrates exemplary a human-agent device interaction andexemplary processing performed by the automated companion, according toan embodiment of the present teaching. As seen from FIG. 4C, operationsat different layers may be conducted and together they facilitateintelligent dialogue in a cooperated manner. In the illustrated example,an agent device may first ask a user “How are you doing today?” at 402to initiate a conversation. In response to utterance at 402, the usermay respond with utterance “Ok” at 404. To manage the dialogue, theautomated companion may activate different sensors during the dialogueto make observation of the user and the surrounding environment. Forexample, the agent device may acquire multi-modal data about thesurrounding environment where the user is in. Such multi-modal data mayinclude audio, visual, or text data. For example, visual data maycapture the facial expression of the user. The visual data may alsoreveal contextual information surrounding the scene of the conversation.For instance, a picture of the scene may reveal that there is abasketball, a table, and a chair, which provides information about theenvironment and may be leveraged in dialogue management to enhanceengagement of the user. Audio data may capture not only the speechresponse of the user but also other peripheral information such as thetone of the response, the manner by which the user utters the response,or the accent of the user.

Based on acquired multi-modal data, analysis may be performed by theautomated companion (e.g., by the front end user device or by thebackend user interaction engine 140) to assess the attitude, emotion,mindset, and utility of the users. For example, based on visual dataanalysis, the automated companion may detect that the user appears sad,not smiling, the user's speech is slow with a low voice. Thecharacterization of the user's states in the dialogue may be performedat layer 2 based on multi-model data acquired at layer 1. Based on suchdetected observations, the automated companion may infer (at 406) thatthe user is not that interested in the current topic and not thatengaged. Such inference of emotion or mental state of the user may, forinstance, be performed at layer 4 based on characterization of themulti-modal data associated with the user.

To respond to the user's current state (not engaged), the automatedcompanion may determine to perk up the user in order to better engagethe user. In this illustrated example, the automated companion mayleverage what is available in the conversation environment by uttering aquestion to the user at 408: “Would you like to play a game?” Such aquestion may be delivered in an audio form as speech by converting textto speech, e.g., using customized voices individualized for the user. Inthis case, the user may respond by uttering, at 410, “Ok.” Based on thecontinuously acquired multi-model data related to the user, it may beobserved, e.g., via processing at layer 2, that in response to theinvitation to play a game, the user's eyes appear to be wandering, andin particular that the user's eyes may gaze towards where the basketballis located. At the same time, the automated companion may also observethat, once hearing the suggestion to play a game, the user's facialexpression changes from “sad” to “smiling.” Based on such observedcharacteristics of the user, the automated companion may infer, at 412,that the user is interested in basketball.

Based on the acquired new information and the inference based on that,the automated companion may decide to leverage the basketball availablein the environment to make the dialogue more engaging for the user yetstill achieving the educational goal for the user. In this case, thedialogue management at layer 3 may adapt the conversion to talk about agame and leverage the observation that the user gazed at the basketballin the room to make the dialogue more interesting to the user yet stillachieving the goal of, e.g., educating the user. In one exampleembodiment, the automated companion generates a response, suggesting theuser to play a spelling game” (at 414) and asking the user to spell theword “basketball.”

Given the adaptive dialogue strategy of the automated companion in lightof the observations of the user and the environment, the user mayrespond providing the spelling of word “basketball.” (at 416).Observations are continuously made as to how enthusiastic the user is inanswering the spelling question. If the user appears to respond quicklywith a brighter attitude, determined based on, e.g., multi-modal dataacquired when the user is answering the spelling question, the automatedcompanion may infer, at 418, that the user is now more engaged. Tofurther encourage the user to actively participate in the dialogue, theautomated companion may then generate a positive response “Great job!”with instruction to deliver this response in a bright, encouraging, andpositive voice to the user.

FIG. 5 illustrates exemplary communications among different processinglayers of an automated dialogue companion centered around a dialoguemanager 510, according to various embodiments of the present teaching.The dialogue manager 510 in FIG. 5 corresponds to a functional componentof the dialogue management at layer 3. A dialog manager is an importantpart of the automated companion and it manages dialogues. Traditionally,a dialogue manager takes in as input a user's utterances and determinehow to respond to the user. This is performed without taking intoaccount the user's preferences, user's mindset/emotions/intent, orsurrounding environment of the dialogue, i.e., given any weights to thedifferent available states of the relevant world. The lack of anunderstanding of the surrounding world often limits the perceivedauthenticity of or engagement in the conversations between a human userand an intelligent agents.

In some embodiments of the present teaching, the utility of parties of aconversation relevant to an on-going dialogue is exploited to allow amore personalized, flexible, and engaging conversion to be carried out.It facilitates an intelligent agent acting in different roles to becomemore effective in different tasks, e.g., scheduling appointments,booking travel, ordering equipment and supplies, and researching onlineon various topics. When an intelligent agent is aware of a user'sdynamic mindset, emotions, intent, and/or utility, it enables the agentto engage a human conversant in the dialogue in a more targeted andeffective way. For example, when an education agent teaches a child, thepreferences of the child (e.g., color he loves), the emotion observed(e.g., sometimes the child does not feel like continue the lesson), theintent (e.g., the child is reaching out to a ball on the floor insteadof focusing on the lesson) may all permit the education agent toflexibly adjust the focus subject to toys and possibly the manner bywhich to continue the conversation with the child so that the child maybe given a break in order to achieve the overall goal of educating thechild.

As another example, the present teaching may be used to enhance acustomer service agent in its service by asking questions that are moreappropriate given what is observed in real-time from the user and henceachieving improved user experience. This is rooted in the essentialaspects of the present teaching as disclosed herein by developing themeans and methods to learn and adapt preferences or mindsets of partiesparticipating in a dialogue so that the dialogue can be conducted in amore engaging manner.

Dialogue manager (DM) 510 is a core component of the automatedcompanion. As shown in FIG. 5 , DM 510 (layer 3) takes input fromdifferent layers, including input from layer 2 as well as input fromhigher levels of abstraction such as layer 4 for estimating mindsets ofparties involved in a dialogue and layer 5 that learnsutilities/preferences based on dialogues and assessed performancesthereof. As illustrated, at layer 1, multi-modal information is acquiredfrom sensors in different modalities which is processed to, e.g., obtainfeatures that characterize the data. This may include signal processingin visual, acoustic, and textual modalities.

Such multi-modal information may be acquired by sensors deployed on auser device, e.g., 110-a during the dialogue. The acquired multi-modalinformation may be related to the user operating the user device 110-aand/or the surrounding of the dialogue scene. In some embodiments, themulti-model information may also be acquired by an agent device, e.g.,160-a, during the dialogue. In some embodiments, sensors on both theuser device and the agent device may acquire relevant information. Insome embodiments, the acquired multi-model information is processed atLayer 1, as shown in FIG. 5 , which may include both a user device andan agent device. Depending on the situation and configuration, Layer 1processing on each device may differ. For instance, if a user device110-a is used to acquire surround information of a dialogue, includingboth information about the user and the environment around the user, rawinput data (e.g., text, visual, or audio) may be processed on the userdevice and then the processed features may then be sent to Layer 2 forfurther analysis (at a higher level of abstraction). If some of themulti-modal information about the user and the dialogue environment isacquired by an agent device, the processing of such acquired raw datamay also be processed by the agent device (not shown in FIG. 5 ) andthen features extracted from such raw data may then be sent from theagent device to Layer 2 (which may be located in the user interactionengine 140).

Layer 1 also handles information rendering of a response from theautomated dialogue companion to a user. In some embodiments, therendering is performed by an agent device, e.g., 160-a and examples ofsuch rendering include speech, expression which may be facial orphysical acts performed. For instance, an agent device may render a textstring received from the user interaction engine 140 (as a response tothe user) to speech so that the agent device may utter the response tothe user. In some embodiments, the text string may be sent to the agentdevice with additional rendering instructions such as volume, tone,pitch, etc. which may be used to convert the text string into a soundwave corresponding to an utterance of the content in a certain manner.In some embodiments, a response to be delivered to a user may alsoinclude animation, e.g., utter a response with an attitude which may bedelivered via, e.g., a facial expression or a physical act such asraising one arm, etc. In some embodiments, the agent may be implementedas an application on a user device. In this situation, rendering of aresponse from the automated dialogue companion is implemented via theuser device, e.g., 110-a (not shown in FIG. 5 ).

Processed features of the multi-modal data may be further processed atlayer 2 to achieve language understanding and/or multi-modal dataunderstanding including visual, textual, and any combination thereof.Some of such understanding may be directed to a single modality, such asspeech understanding, and some may be directed to an understanding ofthe surrounding of the user engaging in a dialogue based on integratedinformation. Such understanding may be physical (e.g., recognize certainobjects in the scene), perceivable (e.g., recognize what the user said,or certain significant sound, etc.), or mental (e.g., certain emotionsuch as stress of the user estimated based on, e.g., the tune of thespeech, a facial expression, or a gesture of the user).

The multimodal data understanding generated at layer 2 may be used by DM510 to determine how to respond. To enhance engagement and userexperience, the DM 510 may also determine a response based on theestimated mindsets of the user and of the agent from layer 4 as well asthe utilities of the user engaged in the dialogue from layer 5. Themindsets of the parties involved in a dialogue may be estimated based oninformation from Layer 2 (e.g., estimated emotion of a user) and theprogress of the dialogue. In some embodiments, the mindsets of a userand of an agent may be estimated dynamically during the course of adialogue and such estimated mindsets may then be used to learn, togetherwith other data, utilities of users. The learned utilities representpreferences of users in different dialogue scenarios and are estimatedbased on historic dialogues and the outcomes thereof.

In each dialogue of a certain topic, the dialogue manager 510 bases itscontrol of the dialogue on relevant dialogue tree(s) that may or may notbe associated with the topic (e.g., may inject small talks to enhanceengagement). To generate a response to a user in a dialogue, thedialogue manager 510 may also consider additional information such as astate of the user, the surrounding of the dialogue scene, the emotion ofthe user, the estimated mindsets of the user and the agent, and theknown preferences of the user (utilities).

An output of DM 510 corresponds to an accordingly determined response tothe user. To deliver a response to the user, the DM 510 may alsoformulate a way that the response is to be delivered. The form in whichthe response is to be delivered may be determined based on informationfrom multiple sources, e.g., the user's emotion (e.g., if the user is achild who is not happy, the response may be rendered in a gentle voice),the user's utility (e.g., the user may prefer speech in certain accentsimilar to his parents'), or the surrounding environment that the useris in (e.g., noisy place so that the response needs to be delivered in ahigh volume). DM 510 may output the response determined together withsuch delivery parameters.

In some embodiments, the delivery of such determined response isachieved by generating the deliverable form(s) of each response inaccordance with various parameters associated with the response. In ageneral case, a response is delivered in the form of speech in somenatural language. A response may also be delivered in speech coupledwith a particular nonverbal expression as a part of the deliveredresponse, such as a nod, a shake of the head, a blink of the eyes, or ashrug. There may be other forms of deliverable form of a response thatis acoustic but not verbal, e.g., a whistle.

To deliver a response, a deliverable form of the response may begenerated via, e.g., verbal response generation and/or behavior responsegeneration, as depicted in FIG. 5 . Such a responses in its determineddeliverable form(s) may then be used by a renderer to actual render theresponse in its intended form(s). For a deliverable form in a naturallanguage, the text of the response may be used to synthesize a speechsignal via, e.g., text to speech techniques, in accordance with thedelivery parameters (e.g., volume, accent, style, etc.). For anyresponse or part thereof, that is to be delivered in a non-verbalform(s), e.g., with a certain expression, the intended non-verbalexpression may be translated into, e.g., via animation, control signalsthat can be used to control certain parts of the agent device (physicalrepresentation of the automated companion) to perform certain mechanicalmovement to deliver the non-verbal expression of the response, e.g.,nodding head, shrug shoulders, or whistle. In some embodiments, todeliver a response, certain software components may be invoked to rendera different facial expression of the agent device. Such rendition(s) ofthe response may also be simultaneously carried out by the agent (e.g.,speak a response with a joking voice and with a big smile on the face ofthe agent).

FIG. 6 depicts an exemplary high level system diagram for an artificialintelligence based educational companion, according to variousembodiments of the present teaching. In this illustrated embodiment,there are five levels of processing, namely device level, processinglevel, reasoning level, pedagogy or teaching level, and educator level.The device level comprising sensors such as microphone and camera ormedia delivery devices such as servos to move, e.g., body parts of arobot or speakers to deliver dialogue content. The processing levelcomprises various processing components directed to processing ofdifferent types of signals, which include both input and output signals.

On the input side, the processing level may include speech processingmodule for performing, e.g., speech recognition based on audio signalobtained from an audio sensor (microphone) to understand what is beinguttered in order to determine how to respond. The audio signal may alsobe recognized to generate text information for further analysis. Theaudio signal from the audio sensor may also be used by an emotionrecognition processing module. The emotion recognition module may bedesigned to recognize various emotions of a party based on both visualinformation from a camera and the synchronized audio information. Forinstance, a happy emotion may often be accompanied with a smile face anda certain acoustic cues. The text information obtained via speechrecognition may also be used by the emotion recognition module, as apart of the indication of the emotion, to estimate the emotion involved.

On the output side of the processing level, when a certain responsestrategy is determined, such strategy may be translated into specificactions to take by the automated companion to respond to the otherparty. Such action may be carried out by either deliver some audioresponse or express certain emotion or attitude via certain gesture.When the response is to be delivered in audio, text with words that needto be spoken are processed by a text to speech module to produce audiosignals and such audio signals are then sent to the speakers to renderthe speech as a response. In some embodiments, the speech generatedbased on text may be performed in accordance with other parameters,e.g., that may be used to control to generate the speech with certaintones or voices. If the response is to be delivered as a physicalaction, such as a body movement realized on the automated companion, theactions to be taken may also be instructions to be used to generate suchbody movement. For example, the processing level may include a modulefor moving the head (e.g., nodding, shaking, or other movement of thehead) of the automated companion in accordance with some instruction(symbol). To follow the instruction to move the head, the module formoving the head may generate electrical signal, based on theinstruction, and send to servos to physically control the head movement.

The third level is the reasoning level, which is used to perform highlevel reasoning based on analyzed sensor data. Text from speechrecognition, or estimated emotion (or other characterization) may besent to an inference program which may operate to infer various highlevel concepts such as intent, mindset, preferences based on informationreceived from the second level. The inferred high level concepts maythen be used by a utility based planning module that devises a plan torespond in a dialogue given the teaching plans defined at the pedagogylevel and the current state of the user. The planned response may thenbe translated into an action to be performed to deliver the plannedresponse. The action is then further processed by an action generator tospecifically direct to different media platform to carry out theintelligent response.

The pedagogy and educator levels both related to the educationalapplication as disclosed. The educator level includes activities relatedto designing curriculums for different subject matters. Based ondesigned curriculum, the pedagogy level includes a curriculum schedulerthat schedules courses based on the designed curriculum and based on thecurriculum schedule, the problem settings module may arrange certainproblems settings be offered based on the specific curriculum schedule.Such problem settings may be used by the modules at the reasoning levelto assist to infer the reactions of the users and then plan the responseaccordingly based on utility and inferred state of mind.

In a user machine dialogue system, the dialogue manager such as 510 inFIG. 5 plays a central role. It receives input from a user device or anagent device with observations (of the user's utterance, facialexpression, surroundings, etc.) and determines a response that isappropriate given the current state of the dialogue and the objective(s)of the dialogue. For instance, if an objective of a particular dialogueis to teach the concept of triangulation to the user, a response devisedby the dialogue manager 510 is determined not only based on a previouscommunication from the user but also the objective of ensuring that theuser learns the concept. Traditionally, a dialogue system drivescommunication with a human user by exploring a dialogue tree associatedwith the intended purpose of the dialogue and a current state of theconversation. This is illustrated in FIG. 7 , where a user 700interfacing with a device 710 to carry out a conversation. During theconversation, user utters some speech that is sent to the device 710 andbased on the utterance of the user, the device sends a request to aserver 720, which then provides a response (obtained based on a dialoguetree 750) to the device, which then renders the response to the user.Due to limited computation power and memory, most of the computationneeded to generate a response to the user is performed at server 720.

In operation, from the perspective of the device 710, it acquiresutterance from the user 700 related to the dialogue, transmits a requestwith the acquired user's information to the server 720, subsequentlyreceives a response determined by the server 720, and renders theresponse on the device 710 to user 700. On the server side, it comprisesa controller 730 which may be deployed to interface with the device110-a and a dialogue manager 740 that drives the dialogue with a userbased on an appropriate dialogue tree 750. The dialogue tree 750 may beselected from a plurality of dialogue trees based on the currentdialogue. For instance, if a current dialogue is for booking a flight,the dialogue tree selected for the dialogue manager 740 to drive theconversation may be specifically constructed for that intended purpose.

When user's information is received, the controller 730 may analyze thereceived user's information, such as what the user uttered, to derive acurrent state of the dialogue. It may then invoke the dialogue manager740 to searches in the dialogue tree 750 based on the current state ofthe dialogue to identify an appropriate response to the user. Such anidentified response is then sent from the dialogue manager 740 to thecontroller 730 which may then forward to the device 710. Such a dialogueprocess requires back and forth communication traffic between the device710 and the server 720, costing time and bandwidth. In addition, in mostsituations the server 720 may be the backbone support to multiple userdevices and/or agent devices (if they are separate from the userdevices). Furthermore, each of the user device may be in a differentdialogue that needs to be driven using a different dialogue tree. Giventhat, when there is a high number of devices relying on the server 720to drive their respective dialogues, as the traditionally the server 720needs to make decisions for all user devices/agent devices, the constantprocessing of information from different dialogues and searching ofdifferent dialogue trees for deriving responses for different dialoguesmay become time consuming, affecting the server's ability to scale up.

The present teaching discloses an alternative configuration to enable adistributed way of conducting a user machine dialogue by intelligentlycaching relevant segments of a full dialogue tree 750 on devices (eithera user device or an agent device). The “relevancy” here may be defineddynamically based on the respective temporal and spatial localityrelated to each dialogue at different time frames. To facilitate theutilization of a local dialogue tree cashed on a device, the casheddialogue tree may be provided in conjunction with a local version of adialogue manager with an appropriate set of functions enabling the localdialogue manager to operate on the cached dialogue tree. With respect toeach local dialogue tree to be cashed on a device, a sub-set of thefunctions associated with the parent dialogue tree (the overall dialoguetree from which the local dialogue tree is carved out) may be determinedand provided dynamically. For example, the functions that enables thelocal dialogue manager to parse the cached local dialogue tree and totraverse the local dialogue tree. In some embodiments, the localdialogue manager to be deployed on a device may be optimized based ondifferent criteria, e.g., the local device type, the specific localdialogue tree, the nature of the dialogue, the observations made fromthe dialogue scene, and/or certain user preferences.

FIG. 8 depicts an exemplary framework directed to distributed dialoguemanagement, according to an embodiment of the present teaching. Asshown, the framework includes a device 810 interfacing with a user 800and a server 840 and the device and the server together drives adialogue with the user 800 in a distributed manner. Depending on theactual dialogue configuration, the device 710 may be a user device,e.g., 110-a, operated by user 700, or an agent device, e.g., 160-a, thatis part of an automated dialogue companion, or a combination thereof.The device is used to interface with user 700 or a user device 110-a tocarry on a dialogue with the user. The device and the server togetherconstitute an automated dialogue companion and manages the dialogue inan efficient and effective manner. In some embodiments, the server isconnected to a plurality of devices to serve as a backend of thesedevices to drive different dialogues with different users on differenttopics.

The device 810 includes, in addition to other components, a localdialogue manager 820, devised for the device with respect to the currentstate of the dialogue, and a local dialogue tree 830, which is a portionof the overall dialogue tree 750 and carved out for the device based onthe progression and the current state of the dialogue. In someembodiments, such a local dialogue tree 830 cached on the device 810 isdetermined and deployed based on an assessment that this portion of thedialogue tree is likely to be needed in the near future by the device810 to drive the dialogue with user 800 given the current state of thedialogue and/or known preferences of the user.

With the local version of the dialogue manager and the dialogue treebeing deployed on the device 810, whenever feasible, the dialogue ismanaged by the local dialogue manager based on the cached local dialoguetree 830. It is in this manner, the traffic and bandwidth consumptioncaused by the frequent communication between the device 810 and theserver 840 is reduced. In operation, if content of the utterance of user800 is within the cached dialogue tree 830, determined by the localdialogue manager 820, the device 810 then provides the response from thecached dialogue tree 830 to the user without having to communicate withthe server. Thus, the speed of responding to the user 800 may alsoimprove.

If there is a cache miss, i.e., given the user's input, the localdialogue manager 820 does not find the response in the cached dialogtree 830, the device 810, it sends a request to the server 840 withinformation related to the current dialogue state, and subsequentlyreceives a response identified by the dialogue manager 860 in the server840 based on the full dialogue tree 750. Because there is a miss, withthe response from the server 840, the device 810 also receives updatedlocal dialogue tree (DT) and local dialogue manager (DM) from the serverso that the previous local version of the DT and DM may be updated withupdated version that is generated adaptively based on the progression ofthe dialogue.

In this illustrated embodiment, the server 840 comprises a controller850, a dialogue manager 860, and a local DM/DT generator 870 (local DMrefers to the local dialogue manager 820 and local DT refers to localdialogue tree 830). The functional role of the dialogue manager 860 isthe same as in the traditional system, to determine a response based onan input from the user in accordance with a dialogue tree 750 selectedto drive the dialogue. In operation, upon receiving a request from thedevice 810 for a response (with user's information), the controller 850invokes not only the dialogue manager 860 to generate the requestedresponse but also the local DM/DT generator 870 to generate, for therequesting device 810, the updated local dialogue tree 830 (DT) and thelocal dialogue manager 820 (DM) with respect to the dialogue tree 750and a current dialogue state, estimated by the dialogue manager 860based on the received user's information. Such generated local DT/DM arethen sent to the device 810 to update the previous version cachedtherein.

FIG. 9 depicts an exemplary high level system diagram of the device 810,according to an embodiment of the present teaching. As discussed herein,the device 810 may be a user device, an agent device, or a combinationthereof. FIG. 9 shows the relevant functional components used toimplement the present teaching and each of such components may reside oneither a user device or an agent device and they work together in acoordinated manner to achieve the aspects of the functions related tothe device 810 of the present teaching. In the illustrated embodiment,the device 810 comprises a sensor data analyzer 910, a surroundinginformation understanding unit 920, the local dialogue manager 820, adevice/server coordinator 930, a response rendering unit 940, a localdialogue manager updater 950, and a local dialogue tree updater 960.FIG. 10 is a flowchart of an exemplary process of the device 810,according to an embodiment of the present teaching. In operation, thesensor data analyzer 910 receives, at 1005 of FIG. 10 , sensor data fromuser 800. Such received sensor data may be multi-modal, including, e.g.,acoustic data representing the speech of the user and/or visual datacorresponding to visual representation of the user (e.g., facialexpression) and/or the surrounding of the dialogue scene.

Upon receiving the sensor data, the sensor data analyzer 910 analyzes,at 1010, the received data and extracts relevant features from thesensor data and send to the surrounding information understanding unit920. For example, based on acoustic features extracted from audio data,the surrounding information understanding unit 920 may determine thetext corresponding to the utterance from the user 800. In someembodiments, features extracted from visual data may also be used tounderstand what is happening in the dialogue. For instance, lip movementof the user 800 may be tracked and features of the lip shape may beextracted and used to understand, in addition to the audio data, thetext of the speech that the user 800 uttered. The surroundinginformation understanding unit 920 may also analyze the features of thesensor data to achieve understanding of other aspects of the dialogue.For instance, the tone of the speech from the user, the facialexpression of the user, objects in the dialogue scene, etc. may also beidentified and used by the local dialogue manager 820 to determine aresponse.

In deriving an understanding of the current state of the dialogue (e.g.,what the user said, or in what manner), the surrounding informationunderstanding unit 920 may rely on various models or sensor dataunderstanding models 925, which may include, e.g., acoustic models forrecognizing the sounds in the dialogue scene, natural languageunderstanding (NLU) models for recognizing what was uttered, objectdetection models for detecting, e.g., user face and other objects in thescene (trees, desk, chair, etc.), emotion detection models for detectingfacial expressions or for detecting tones in speech associated withdifferent emotional states of a person, etc. Such an understanding ofthe current state of the dialogue may then be sent, from the surroundinginformation understanding unit 920 to the local dialogue manager 820 toenable it to determine a response to the user based on the localdialogue tree 830.

Upon receiving the current dialogue state, the local dialogue manager(DM) 820 is invoked to search, at 1015, a response in the local dialoguetree (DT) 830. As discussed herein, a current dialogue state may includeone or more types of information such as a current utterance of theuser, the estimated user emotion/intent, and/or the surround informationof the dialogue scene. A response to the current user's utterance isgenerally generated based on the content of the utterance as well as adialogue tree such as dialogue 750 that is used to drive the dialogue.According to the present teaching, the local DM 820, once invoked,searches the local DT 830 to see if the local DT 830 can be used toidentify an appropriate response. The search is based on the content ofthe current utterance. The intended purpose of deploying the local DM820 and the local DT 830 is that in most situations, a response can befound locally, saving the time and traffic to communicate with theserver 840 to identify a response. If this is the case, as determined at1020, the content of the current utterance from the user falls on anon-leaf node within the local DT 830, the response is one of thebranches from the non-leaf node. That is, the local DM 820 generates, at1025, a response based on the search of the local DT 830 and suchgenerated response is then rendered, by the response rendering unit 940at 1030, to the user.

In some situations, a response cannot be found in the local DT 830. Whenthat occurs, a response needs to be generated by the server 840 inaccordance with the overall dialogue tree 750. There may be differentscenarios in which a response cannot be found by the local DM 820 basedin the local DT 830. For example, the content of the current utterancefrom the user may not be found in the local DT 830. In this case, theresponse to a non-recognized utterance from the user is to be determinedby the server 840. In a different situation, the current utterance isfound in the local DT 830 yet the response thereof is not stored locally(e.g., the current dialogue state corresponds to a leaf node of thelocal DT 830). In this case, a response is also not available locally.In both scenarios, the local dialogue tree cached in 830 cannot be usedto drive the dialogue further and then the local DM 820 invokes thedevice/server coordinator 930 to send, at 1035, a request to the server840 for a response with information relevant to the dialogue state tofacilitate the server to identify an appropriate response. Thedevice/server coordinator 930 subsequently receives, at 1040 and 1045,respectively, the response sought and the renewed local DM and local DT.Upon receiving the updated local DM and local DT, the device/servercoordinator 930 then invokes the local dialogue manager updater 950 andthe local dialogue tree updater 960 to update, at 1050, the local DM 820and local DT 830. The device/server coordinator 930 also sends, at 1055,the received response to the response rendering unit 940 so that theresponse may be rendered at 1030 to the user.

FIG. 11 depicts an exemplary system diagram of the server 840, accordingto an embodiment of the present teaching. In this illustratedembodiment, the sever 840 comprises a device interface unit 1110, acurrent local DM/DT information retriever 1120, a current user stateanalyzer 1140, the dialogue manager 860, an updated local DT determiner1160, an updated local DM determiner 1150, and a local DM/DT generator870. FIG. 12 is a flowchart of an exemplary process of the server 840,according to an embodiment of the present teaching. In operation, whenthe device interface unit 1110 receives, at 1210 of FIG. 12 , a requestfrom a device seeking a response with information relevant to thecurrent state of the dialogue, it invokes the current user stateanalyzer 1140 to analyze, at 1220, the received relevant information tounderstand the user's input. To identify a response to the user's input,the dialogue manager 860 is invoked to search, at 1230, the fulldialogue tree 750 to obtain a response.

As discussed herein, when the server 840 is requested to provide aresponse to a dialogue at a device, it indicates that the local DM 820and the local DT 830 previously deployed on that device no longer works(they already led to a miss) for that local dialogue. As such, inaddition to provide a response for the device, the server 840 alsogenerates updated local DM and local DT to be cached at the device. Insome embodiments, to achieve that, the device interface unit 1110 alsoinvokes the current local DM/DT Info retriever 1120 to retrieve, at1240, information related to the local DM/DT previously deployed on thedevice.

Such retrieved information about the previously deployed local DM andlocal DT, together with the currently server generated response and thecurrent state of the dialogue, are sent to the updated local DTdeterminer 1160 and the updated local DM determiner 1150 to determine,at 1250, an updated local DT and an updated local DM with respect to thecurrent response and the current dialogue state. Such determined updatedlocal DM/DT are then sent to the local DM/DT generator 870, which thengenerates, at 1260, the updated local DM/DT to be sent to the device.The generated updated local DM/DT are then archived in the local DT/DMdispatch archive 1130 and then sent to the device by the deviceinterface unit 1110. In this manner, whenever there is a miss, theserver 840 updates the local DM/DT on the device so that thecommunication traffic and the bandwidth required for the server tosupport the device may be reduced and, hence, the speed of responding tousers in human machine dialogues may be enhanced.

Traditionally, a dialogue management system such as the dialogue manager840 takes in text (e.g., generated based on speech understanding) andoutput text based on a search of a dialogue tree. In a sense, a dialoguetree corresponds to a decision tree. At each step of a dialogue drivenbased on such a decision tree, there may be a node representing acurrent utterance and multiple choices branching from the noderepresenting all possible answers connected to the node. Thus, from eachnode, a possible response may be along any of the multiple paths. Inthis sense, the process of a dialogue traverses a dialogue tree andforms a dialogue a path, as shown in FIG. 4B. The job of the dialoguemanager is to determine a choice at each node (representing an utteranceof the user) by, e.g., optimizing some gain in terms of the underlyingdialogue. The determination of the selected path may take time based oninformation from different sources and different aspects of theunderstanding of the scenario surrounding the dialogue.

In addition, due to limited computation power and memory, much of thecomputation work to generate a response to the user is performed at aserver, e.g., 720 in FIG. 7 . For example, when user's information isreceived, the server 720 may analyze the user's information tounderstand what is said. The dialogue manager 740 residing on the serverthen searches the dialogue tree 750 to identify an appropriate response.As discussed herein, this dialogue process heavily relies on the backendserver and requires back and forth communication traffic between thedevice 710 and the server 720. This costs time and bandwidth, affectingthe server's ability to scale in conducting concurrent real-timedialogues with a plurality of users.

The present teaching further discloses an approach that enable furtherreduction of the response time in human machine dialogues by predictingwhich path(s) in the dialogue tree 750 the user likely will take in thenear future and preemptively generating predicted responses along thepredicted path. The prediction of the path for each user may be based onmodels that characterize, e.g., preferences of the user, and are createdvia machine learning based on, e.g., past dialogue histories and/orcommon knowledge. Such trainings may be personalized at different levelsof granularity. For instance, the models learned for prediction dialoguepaths may be individualized based on past data collected with respect toindividuals. The models for such dialogue path prediction may also beindividualized by training the personalized models based on relevanttraining data, e.g., to train a model for a group of users who sharesimilar characteristics, training data for similar users may be used.Such training and prediction may be performed offline and the trainedresult may then be applied for online operations to reduce both theresponse time and the computational burden of the dialogue manager sothat the server may scale better to take care of a high volume ofrequests.

By predicting dialogue path(s) and generating likely responsespreemptively, when a response is among the preemptively generatedresponses, the pre-generated response may then be directly provided tothe user without having to invoking the dialogue manager to search thedialogue tree, e.g., 750. If a response is not among those preemptivelygenerated, a request may then be made by requesting the dialogue managerto search the dialogue tree to come up with the response. FIG. 13depicts an exemplary embodiment of the framework of using preemptivelypredicted a dialogue path (subpart of the overall dialogue tree 750) anddialogue content (responses), according to an embodiment of the presentteaching. In this illustrated embodiment, a user 1300 communicating viaa device 1310, which may be similarly constructed as in FIG. 7 andcommunicate with a server 1320 that utilizes preemptively predicteddialogue path and responses to enhance the latency in responding to ahuman user. In this illustrated embodiment, the server 1320 comprises acontroller 1330, a dialogue manager 1340, a predicted path/responsegenerator 1350, that generates both predicted dialogue paths 1360 andaccordingly the preemptively generated responses 1370. In operation,when the device 1310 receives user information (utterance, video, etc.),to determine a response, the device 1310 sends a request to the server1320, seeking a response with information related to the dialogue statesuch as the utterance and/or observations of the surrounding of thedialogue, e.g., user's attitude, emotion, intent, the objects andcharacterization thereof in the dialogue scene. If the requestedresponse is in the predicted paths 1360, a corresponding preemptivelygenerated response in 1370 is then retrieved directly from the predictedresponses 1370 and sent to the device 1310. In this manner, as thedialogue manager 1340 is not invoked to process the request and tosearch the dialogue tree 750 to obtain the response so that the latencyof providing the response is improved.

FIG. 14 depicts an exemplary high level system diagram of the server1320, according to an embodiment of the present teaching. In theillustrated embodiment, the server 1320 comprises a dialogue stateanalyzer 1410, a response source determiner 1420, the dialogue manager1340, a predicted response retriever 1430, a response transmitter 1440,a predicted path generator 1460, and a predicted response generator1450. FIG. 15 is a flowchart of the exemplary process of the server1320, according to an embodiment of the present teaching. In operation,the dialogue state analyzer 1410 receives, at 1505 of FIG. 15 , arequest with information related to the state of the underlying dialogueincluding, e.g., acoustic data representing the speech of the user oranalyzed speech of the user and optionally other information related tothe dialogue state. Such received information is analyzed, at 1510. Todetermine whether a response appropriate to respond to the user'sutterance has been preemptively generated previously, the responsesource determiner 1420 is invoked to determine, at 1515, whether apredicted path relevant to the user's current utterance exists based onwhat is stored in predicted path 1360. If a predicted path 1360 relevantto the user's current utterance exists, it is further checked, at 1520,whether a desired response for the current utterance with respect to thepredicted path exists in the predicted path, i.e., whether the desiredresponse for the current utterance has been preemptively generated. Ifthe desired response has been previously generated, the response sourcedeterminer 1420 invokes the predicted response retriever 1430 toretrieve, at 1525, the preemptively generated response from thepredicted responses 1370 and then invokes the response transmitter 1440to send, at 1530, the preemptively generated response to the device1310.

If either the predicted path relevant to the utterance does not exist,determined at 1515, or the desired response is not preemptivelygenerated (in a predicted path), determined at 1520, the processproceeds to invoke the dialogue manager 1340, at 1535, to generate aresponse with respect to the current user's utterance. This involvessearching the dialogue tree 750 to identify the response. In the eventof a miss (i.e., a predicted path does not exist or an existingpredicted path does not include the response), the dialogue manager 1340may also activate the predicted path generator 1460 to predict a pathgiven the current utterance/response identified. Upon being activated,to generate a predicted path, the predicted path generator 1460 mayanalyze, at 1540, the currently generated response and optionally also aprofile for the user currently involved in the dialogue, which isretrieved from a user profile storage 1470. Based on such information,the predicted path generator 1460 predicts, at 1545, a path based on thecurrent utterance/response, the dialogue tree 750, and optionally theuser profile. Based on the predicted path, the predicted responsegenerator 1450 generates, at 1550, predicted responses associated withthe newly predicted path, i.e., preemptively generating responses. Suchpredicted new path and their preemptively generated predicted responsesare then stored, at 1555, by the predicted path generator 1460 and thepredicted response generator 1450 in the predicted path storage 1360 andthe predicted responses storage 1370, respectively. Then the response soidentified is returned, at 1530, to the device to respond to the user.

FIG. 16 depicts a different exemplary configuration between a device1610 and a server 1650 in managing a dialogue with a user, according toembodiments of the present teaching. Compared with the embodiment shownin FIG. 13 , to further enhance performance and reduce latency andtraffic, the configuration in FIG. 16 also deploys, on the device 1610,a local dialogue manager 1620 with corresponding local predicted path1640 as well as corresponding preemptively generated local predictedresponses 1630. The local dialogue manager 1620 operates based on thelocal predicted path 1640 and the local predicted responses 1630 todrive the dialogue as much as it can. When there is a miss, the devicesends a request with information related to the dialogue state to theserver 1650 to seek a response. As shown, the server 1650 also storesserver versions of the predicted path 1360 and server version of thepredicted responses 1370. In some embodiments, the server predicted path1360 and the server predicted responses 1370 stored on the server 1650may not be the same as the local versions 1640 and 1630. For instance,the server predicted path 1360 may be more extensive than the localpredicted path 1640. Such a difference may be based on differentoperational considerations such as limitations on local storage ortransmission size restrictions.

In operation, when there is a miss on the device, the device 1610 sendsa request to the server 1650, requesting a response for the on-goingdialogue with information related to the dialogue. When that happens,the server 1650 may identify an appropriate response and sends theresponse to the device. Such a response identified by the serve may beone of the server predicted responses 1370 in the server predicted path1360. If the response cannot be found in the server predictedpath/responses, the server may then search in the overall dialogue treeto identify the response. With two levels (device and server) of cashedpredicted path and responses, the time needed to generate a response isfurther reduced.

As shown in the configuration in FIG. 16 , the device 1610 includes alocal dialogue manager 1620 deployed to function locally to generate aresponse to a user 1600 by searching a local version 1640 of thepredicted path 1360 and preemptively generated responses 1630 which arelocal version of the predicted responses stored on the server 1650. Ifthe local dialogue manager 1620 finds a response locally on based onpredicted path 1640 and predicted responses 1630, the device 1610 willprovide the response to the user without requesting a response from theserver 1650. In this configuration, when there is a miss, the device1610 requests the server 1650 to provide a response. Upon receiving therequest, the server 1650 may proceed to generate a response based on theserver predicted path 1360 and the server predicted responses 1370. Ifthe server predicted path 1360 and the server predicted responses 1370are more extensive than the local predicted path 1640 and thecorresponding local predicted responses 1630, a response not found inthe local predicted path/responses may be included in the serverpredicted path/responses. Only if the server 1650 is unable to find aresponse in its predicted path 1360 and predicted responses 1370, thenthe server 1650 proceeds to search for a response in the overalldialogue tree 750.

In addition to identify a response for the device, the server 1650 mayalso generate updated local predicted path 1640, the corresponding localpredicted responses 1630, as well as the updated local dialogue manager1620 that is operable with respect to the updated local predictedpath/responses. The updated local predicted path/responses and theupdated local dialogue manager may then be sent to the device for futureoperation. The updated local version of the predicted path and predictedresponses may be generated based on either the overall dialogue tree 750or the server predicted path 1360 and server predicted responses 1370.In some situations, when the server cannot identify an appropriateresponse from the server predicted path 1360 and server predictedresponses 1370, in this case, both the server and local versions of thepredicted path/responses as well as local dialogue manager need to beupdated. If an appropriate response, although not found on the device1610, is identified from the server predicted path/responses, the serverpredicted path/responses may not need to be updated.

As discussed herein, the updated local predicted path/responses may begenerated by the server when a request for a response is received. Insome situations, the updated local predicted path/responses may begenerated from the existing server predicted path/responses. In somesituations, the updated server predicted path/responses may also need tobe updated so that the updated local predicted path/responses may thenbe generated based on the updated server predicted path/responses thatare generated based on the dialogue tree 750. In this case, the servermay generate both updated server versions and local versions ofpredicted paths and predicted responses, i.e., the updates to predictedpath and predicted responses occur at both the server 1650 and thedevice 1610. Once the updated local predicted path/responses aregenerated, the updated local dialogue manager may then be generatedaccordingly. Once generated, the updated local dialogue information(including the updated local predicted path/responses and the updatedlocal dialogue manager) is then sent from the server to the device sothat it can be used to update the local dialogue manager 1620, thepredicted path 1640, and the predicted responses 1630 on the device.

FIG. 17 depicts an exemplary high level system diagram of the device1610, according to an embodiment of the present teaching. To realize theexemplary configuration as shown in FIG. 16 , the exemplary construct ofthe device 1610 comprises a dialogue state analyzer 1710, a responsesource determiner 1720, the local response manager 1620, a predictedresponse retriever 1730, a response transmitter 1740, a device/servercoordinator 1750, and a predicted path/response updater 1760. The device1610 also includes the local predicted path 1640 and the local predictedresponses 1630, both of which are used by the local dialogue manager1620 to drive the dialogue between the device and a user. As discussedherein, the local predicted path 1640 and the local predicted responses1630 may be updated by the predicted path/response updater 1760 based onthe updated version of the local predicted path/responses received fromthe server 1650 via the device/server coordinator 1750.

FIG. 18 is a flowchart of an exemplary process of the device 1610,according to an embodiment of the present teaching. In operation, whenthe dialogue state analyzer 1710 receives, at 1810 of FIG. 18 ,information related to the on-going dialogue (which includes both theuser's utterance as well as other information surrounding the dialogue),it determines, at 1820, the dialogue state of the dialogue. The surroundinformation related to the dialogue may include multimodal informationsuch as the audio of the user/s utterance, the visual information aboutthe user such as facial expression or gesture of the user, or othertypes of sensory data such as haptic information related to user'smovement. The dialogue state determined by the dialogue state analyzer1710 based on the received surround information may include the contentof the user's utterance, the emotional state of the user determinedbased on, e.g., the facial expression and/or tune of speech of the user,an estimated intent of the user, relevant object(s) in the dialogueenvironment, etc.

Based on the user utterance in the current dialogue state, the responsesource determiner 1720 determines whether a response to the user'sutterance can be identified based on locally stored predicted path 1640and the locally stored predicted responses 1630. For example, at 1830,it is determined whether the local predicted path is relevant to thecurrent utterance. The local predicted path may be relevant when it,e.g., includes a node that corresponds to the current utterance. If thelocal predicted path is relevant, it may further check, at 1840, whetherthe local predicted path includes a preemptively generated (predicted)response that can be used to respond to the user's utterance. If apreemptively generated response in the local predicted path isappropriate as a response to the user, the local dialogue manager 1620is invoked to generate a response based on locally stored predicted path1640 and the locally stored predicted responses 1630. In this case, thelocal dialogue manager 1620 invokes predicted response retriever 1730 toretrieve, at 1850, a preemptively generated response (e.g., according tothe instruction of the local dialogue manager 1620) and forward theretrieved preemptively generated response to the response transmitter1740 to transmit, at 1855, the locally identified response to the user.In this scenario, the device 1610 needs neither to request the server1650 to provide a response (save time) nor to communicate with theserver 1650 (reduce traffic) so that it effectively enhances theperformance in terms of needed computation, the bandwidth, and thelatency.

If the local predicted path is not relevant to the current utterance oran appropriate response to the user's utterance cannot be found in thelocal predicted responses, the device/server coordinator 1750 is invokedto communicate with the server 1650 for a response. To do so, thedevice/server coordinator 1750 sends, at 1860, a request for a responsewith information related to the dialogue state to the server 1650 andwait to receive a feedback. When the device/server coordinator 1750receives the feedback from the server, the feedback may include theresponse sought, received at 1870, as well as an updated local predictedpath with updated predicted responses, and an accordingly generatedupdated local dialogue manager, received at 1880. With such receivedlocal dialogue information, the local dialogue information updater 1760proceeds to update, at 1890, the local dialogue information includingthe local predicted path 1640, the local predicted responses 1630, andthe local dialogue manager 1620. The received response is thentransmitted to the user at 1855 via the response transmitter 1440.

FIG. 19 depicts an exemplary high level system diagram of the server1650, according to an embodiment of the present teaching. In thisillustrated embodiment, the server 1650 comprises a dialogue stateanalyzer 1910, a response source determiner 1920, the dialogue manager1340, a predicted response retriever 1930, a predicted path/responsesgenerator 1960, a local dialogue manager generator 1950, and aresponse/local dialogue info transmitter 1940. FIG. 20 is a flowchart ofan exemplary process of the server 1650, according to an embodiment ofthe present teaching. In operation, when the dialogue state analyzer1910 receives, at 2005 of FIG. 20 , a request for a response withassociated dialogue state information from a device, it analyzes, at2010, the received dialogue state and passes on information to theresponse source determiner 1920 to determine where the response soughtis to be identified. In some situation, the response may be found fromthe server predicted responses associated with the server predicted path1360. In some situations, the response may need to be identified fromthe overall dialogue tree 750.

If the server predicted path 1360 exists, determined at 2015, it isfurther determined, at 2020, whether a response to the current dialoguestate can be found in the server predicted path 1360. If a response canbe found in the server predicted path 1360, the predicted responseretriever 1930 is invoked to retrieve, at 2025, the preemptivelygenerated predicted response from 1370 and the retrieved response issent to the response/path transmitter 1940 for transmitting the responsetogether with other updated dialogue information including an updatedlocal predicted path, updated predicted responses, and an updated localdialogue manager. If no appropriate server predicted path 1360 isavailable to generate a response (e.g., either there is no serverpredicted path or the existing server predicted path 1360 is notrelevant with respect to the current dialogue state) or an appropriateresponse for the current dialogue state cannot be found in the serverpredicted path 1360, the response source determiner 1920 invokes thedialogue manager 1340 to generate, at 2030, a response with respect tothe current dialogue state based on the overall dialogue tree 750.

As discussed herein, whenever the server is called upon to generate aresponse (i.e., there is a miss on the device), it indicates that thelocal predicted path and local predicted responses are no longer able toenable the local dialogue manager to drive the dialogue. Thus, inresponding to the request to provide a response to the device, theserver 1650 may also generate updated local predicted path and updatedpredicted responses for the device. In addition, an updated localdialogue manager may also need to be accordingly generated to beconsistent with the updated local predicted path and responses. Suchupdated local dialogue related information may be generated by theserver and sent to the device together with the response generated.

Furthermore, as there may be also a miss at the server with respect tothe server predicted path 1360 and the server predicted responses 1370,the server predicted path and server predicted responses may also needto be updated when there is a miss at the server level. In thisscenario, both the server and local versions of the predicted paths andresponses may be re-generated and used to update the previous versions.Thus, it is determined, at 2035, whether the server predicted path andthe server predicted responses need to be updated. If so, the predictedpath/response generator 1960 is invoked to generate, at 2040 and 2045,the updated server predicted path and server predicted responses,respectively. In this scenario, the updated server predictedpath/responses are used to generate, at 2050, the updated localpredicted path and the corresponding updated predicted responses.

If the server predicted path/responses do not need to be updated,determined at 2035, the updated local predicted path and responses arethen generated, at 2050, based on the current version of the serverpredicted path and server predicted responses. Based on the updatedlocal predicted path and updated local predicted responses are thenused, by the local dialogue manager generator 1950 to generate, at 2055,an updated local dialogue manager 1620, based on the updated localpredicted path and the updated local predicted responses in accordancewith the dialogue tree 750 and the dialogue manager 1340. The responsegenerated by the server is then sent, at 2060, to the device togetherwith the updated local dialogue information including the updated localpredicted path, the updated local predicted responses, and the updatedlocal dialogue manager so that they can be used by the local dialogueinformation updater 1760 (FIG. 17 ) to update the local predicted path1640, the local predicted responses 1630, and the local dialogue manager1620.

FIG. 21 depicts yet another exemplary operational configuration betweena device and a server in managing a dialogue with a user, according toembodiments of the present teaching. In this illustrated embodiment,instead of retaining copies of server predicted path and serverpredicted (preemptively generated) responses in the server, the serverkeeps a record of what is dispatched to the device related to predictedpaths/responses/local dialogue managers. In this configuration, as thereis no server version of the predicted path and responses, whenever theserver is requested to provide a response, the dialogue manager in theserver will identify such a response directly from the overall dialoguetree. Based on such identified response, the server then proceeds togenerate updated local predicted path/responses and updated localdialogue manager, which can then be transmitted, together with theresponse, to the device. The received updated local versions of thepredicted path/responses/dialogue manager may then be used to replacethe previous local dialogue manager 1620, the previous local predictedpath 1640, and the previous local predicted responses 1630 in order tofacilitate further local dialogue management on the device. This isshown in FIG. 21 where a server 2110 in this configuration includes alocal dialogue information dispatch log 2120.

With this configuration, the device 1610 performs localized dialoguemanagement based on the local predicted path 1640 and correspondinglocal predicted (preemptively generated) responses 1630, both predictedby the server 2110 and deployed dynamically on the device 1610. Theserver 1670 may, upon receiving a request from a device and informationrelated to the current dialogue state, identify a response that thedevice is not able to find in the previously deployed predicted path andthen preemptively generate predicted dialogue path and predictedresponses based on the received information. In this embodiments, theserver 1670 may not maintain predicted dialogue paths for differentdevices and operate based on them. Rather, such predicted dialogue pathsand responses are transmitted to individual devices to enable them toaccordingly manage their own local dialogues. In this configuration, theserver may retain the information in a dispatch log 2120 that recordsthe local predicted dialogue paths and preemptively generated responsesassociated therewith transmitted to different devices. In someembodiments, such logged information may be used in generatingcorresponding updated local predicted paths and preemptively generatedresponses when the previous version can no longer be used to drive thedialogue.

FIG. 22 depicts an exemplary high level system diagram of the server2110, according to an embodiment of the present teaching. Asillustrated, the server 2110 comprises a dialogue state analyzer 2210,the dialogue manager 1340, a local predicted path/response generator2220, a local dialogue manager generator 2230, a local dialogueinformation transmitter 2240, and a dispatch record updater 2250. FIG.23 is a flowchart of an exemplary process of the server 2110, accordingto an embodiment of the present teaching. In operation, when a requestis received, at 2310 by the dialogue state analyzer 2210, it is analyzedand then used by the dialogue manager 1340 to generate, at 2320, aresponse based on the dialogue state and the overall dialogue tree 750.As discussed herein, the dialogue state may include the utterance of theuser operating on the device and other information surrounding thedialogue such as facial expression, estimated emotion state of the user,intent of the user, and relevant objects and characterization thereof inthe dialogue scene. In some embodiments, when the dialogue manager 1340generates the response, it may also consider the information surroundingthe dialogue such as the emotional state of the user and/or profileinformation of the user such as what the user likes. For example, if theuser's utterance is not responsive with a negative emotional state, thedialogue manager 1340 may identify a response that is more driven basedon the profile of the user instead of following the set path in thedialogue tree 750. For instance, if the user's utterance is not quiterelevant to the dialogue and the user appears to be frustrated, thedialogue manager 1340 may select a response that is more driven based onpreference of the user rather than driven by the dialogue tree 750. Ifthe user likes basketball and there is a basketball in the dialoguescene, the dialogue manager 1340 may decide to talk to the user aboutbasketball to refocus the user before continuing on the initial topic ofthe dialogue.

Such a generated response is then used by the local predictedpath/response generator 2220 to generate, at 2330, the updated localpredicted path and updated local responses. The generation of suchupdated local dialogue information may be based on not only the responsebut also on additional information from the dialogue state and/or theprofile of the user. In this manner, the local updated predicted pathand responses are consistent with the response the dialogue manager 1340generated, the current dialogue state, and/or the user's preferences.Based on the updated local predicted path and responses, an updatedlocal dialogue manager is generated, at 2340 by the local dialoguemanager generator 2230. The updated local dialogue information (thelocal predicted path, the local predicted responses, and the localdialogue manager) is then sent to the local dialogue informationtransmitter 2240, which then transmits, at 2350, such information to thedevice 1610 so that the local predicted path, the local predictedresponses, and the local dialogue manager may be replaced with theupdated version to drive the future dialogue locally on the device 1610.The dispatch record updater 2250 then updates, at 2360, the dialogueinformation dispatch log 2120.

FIG. 24 is an illustrative diagram of an exemplary mobile devicearchitecture that may be used to realize a specialized systemimplementing at least some parts of the present teaching in accordancewith various embodiments. In this example, the user device on which thepresent teaching is implemented corresponds to a mobile device 2400,including, but is not limited to, a smart phone, a tablet, a musicplayer, a handled gaming console, a global positioning system (GPS)receiver, and a wearable computing device (e.g., eyeglasses, wristwatch, etc.), or in any other form factor. Mobile device 2400 mayinclude one or more central processing units (“CPUs”) 2440, one or moregraphic processing units (“GPUs”) 2430, a display 2420, a memory 2460, acommunication platform 2410, such as a wireless communication module,storage 2490, and one or more input/output (I/O) devices 2440. Any othersuitable component, including but not limited to a system bus or acontroller (not shown), may also be included in the mobile device 2400.As shown in FIG. 24 a mobile operating system 2470 (e.g., iOS, Android,Windows Phone, etc.), and one or more applications 2480 may be loadedinto memory 2460 from storage 2490 in order to be executed by the CPU2440. The applications 2480 may include a browser or any other suitablemobile apps for managing a conversation system on mobile device 2400.User interactions may be achieved via the I/O devices 2440 and providedto the automated dialogue companion via network(s) 120.

To implement various modules, units, and their functionalities describedin the present disclosure, computer hardware platforms may be used asthe hardware platform(s) for one or more of the elements describedherein. The hardware elements, operating systems and programminglanguages of such computers are conventional in nature, and it ispresumed that those skilled in the art are adequately familiar therewithto adapt those technologies to appropriate settings as described herein.A computer with user interface elements may be used to implement apersonal computer (PC) or other type of work station or terminal device,although a computer may also act as a server if appropriatelyprogrammed. It is believed that those skilled in the art are familiarwith the structure, programming and general operation of such computerequipment and as a result the drawings should be self-explanatory.

FIG. 25 is an illustrative diagram of an exemplary computing devicearchitecture that may be used to realize a specialized systemimplementing at least some parts of the present teaching in accordancewith various embodiments. Such a specialized system incorporating thepresent teaching has a functional block diagram illustration of ahardware platform, which includes user interface elements. The computermay be a general purpose computer or a special purpose computer. Bothcan be used to implement a specialized system for the present teaching.This computer 2500 may be used to implement any component ofconversation or dialogue management system, as described herein. Forexample, conversation management system may be implemented on a computersuch as computer 2500, via its hardware, software program, firmware, ora combination thereof. Although only one such computer is shown, forconvenience, the computer functions relating to the conversationmanagement system as described herein may be implemented in adistributed fashion on a number of similar platforms, to distribute theprocessing load.

Computer 2500, for example, includes COM ports 2550 connected to andfrom a network connected thereto to facilitate data communications.Computer 2500 also includes a central processing unit (CPU) 2520, in theform of one or more processors, for executing program instructions. Theexemplary computer platform includes an internal communication bus 2510,program storage and data storage of different forms (e.g., disk 2570,read only memory (ROM) 2530, or random access memory (RAM) 2540), forvarious data files to be processed and/or communicated by computer 2500,as well as possibly program instructions to be executed by CPU 2520.Computer 1300 also includes an I/O component 2560, supportinginput/output flows between the computer and other components thereinsuch as user interface elements 2580. Computer 2500 may also receiveprogramming and data via network communications.

Hence, aspects of the methods of dialogue management and/or otherprocesses, as outlined above, may be embodied in programming. Programaspects of the technology may be thought of as “products” or “articlesof manufacture” typically in the form of executable code and/orassociated data that is carried on or embodied in a type of machinereadable medium. Tangible non-transitory “storage” type media includeany or all of the memory or other storage for the computers, processorsor the like, or associated modules thereof, such as varioussemiconductor memories, tape drives, disk drives and the like, which mayprovide storage at any time for the software programming.

All or portions of the software may at times be communicated through anetwork such as the Internet or various other telecommunicationnetworks. Such communications, for example, may enable loading of thesoftware from one computer or processor into another, for example, inconnection with conversation management. Thus, another type of mediathat may bear the software elements includes optical, electrical andelectromagnetic waves, such as used across physical interfaces betweenlocal devices, through wired and optical landline networks and overvarious air-links. The physical elements that carry such waves, such aswired or wireless links, optical links or the like, also may beconsidered as media bearing the software. As used herein, unlessrestricted to tangible “storage” media, terms such as computer ormachine “readable medium” refer to any medium that participates inproviding instructions to a processor for execution.

Hence, a machine-readable medium may take many forms, including but notlimited to, a tangible storage medium, a carrier wave medium or physicaltransmission medium. Non-volatile storage media include, for example,optical or magnetic disks, such as any of the storage devices in anycomputer(s) or the like, which may be used to implement the system orany of its components as shown in the drawings. Volatile storage mediainclude dynamic memory, such as a main memory of such a computerplatform. Tangible transmission media include coaxial cables; copperwire and fiber optics, including the wires that form a bus within acomputer system. Carrier-wave transmission media may take the form ofelectric or electromagnetic signals, or acoustic or light waves such asthose generated during radio frequency (RF) and infrared (IR) datacommunications. Common forms of computer-readable media thereforeinclude for example: a floppy disk, a flexible disk, hard disk, magnetictape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any otheroptical medium, punch cards paper tape, any other physical storagemedium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM,any other memory chip or cartridge, a carrier wave transporting data orinstructions, cables or links transporting such a carrier wave, or anyother medium from which a computer may read programming code and/ordata. Many of these forms of computer readable media may be involved incarrying one or more sequences of one or more instructions to a physicalprocessor for execution.

Those skilled in the art will recognize that the present teachings areamenable to a variety of modifications and/or enhancements. For example,although the implementation of various components described above may beembodied in a hardware device, it may also be implemented as a softwareonly solution—e.g., an installation on an existing server. In addition,the fraudulent network detection techniques as disclosed herein may beimplemented as a firmware, firmware/software combination,firmware/hardware combination, or a hardware/firmware/softwarecombination.

While the foregoing has described what are considered to constitute thepresent teachings and/or other examples, it is understood that variousmodifications may be made thereto and that the subject matter disclosedherein may be implemented in various forms and examples, and that theteachings may be applied in numerous applications, only some of whichhave been described herein. It is intended by the following claims toclaim any and all applications, modifications and variations that fallwithin the true scope of the present teachings.

We claim:
 1. A method implemented on at least one machine including atleast one processor, memory, and communication platform capable ofconnecting to a network for managing a user machine dialogue, the methodcomprising: receiving, at a device, sensor data including an utterancerepresenting a speech of a user engaged in a dialogue with the device;determining the speech of the user based on the utterance; searching, bya local dialogue manager residing on the device, a sub-dialogue treestored on the device for a response to the user based on the speech;rendering the response to the user in response to the speech, if theresponse is identified from the sub-dialogue tree; and sending, if theresponse is not available in the sub-dialogue tree, a request to aserver for the response.